Boot-Strapping Language Identifiers for Short Colloquial Postings

نویسندگان

Moisés Goldszmidt

Marc Najork

Stelios Paparizos

چکیده

There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very ‘clean’ editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular. In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models. With this work we provide a guide and a publicly available tool [1] to the mining community for language identification on web and social data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bootstrapping Knowledge About Social Phenomena Using Simulation Models

There are considerable difficulties in the way of the development of useful and reliable simulation models of social phenomena, including that any simulation necessarily includes many assumptions that are not directly supported by evidence. Despite these difficulties, many still hope to develop quite general models of social phenomena. This paper argues that such hopes are ill-founded, in other...

متن کامل

A Study of Colloquial Language in Jalal Al-e-Ahmad’s Fictions

As the most prominent novelist in contemporary Persian prose, Jalal Ale-Ahmad has had great influence on Persian writers, insofar as many writers have followed his suit. Employment of colloquial language is the characteristic style of his fiction. What makes his different, however, is mainly the employment of colloquialism in a subtle, precise and accurate way. Due to the extensive use of collo...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

متن کامل

Bootstrapping Without the Boot

What: We like minimally supervised learning (bootstrapping). Let’s convert it to unsupervised learning (“strapping”). How: If the supervision is so minimal, let’s just guess it! Lots of guesses lots of classifiers. Try to predict which one looks plausible (!?!). We can learn to make such predictions. Results (on WSD): Performance actually goes up! (Unsupervised WSD for translational senses, Eng...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Boot-Strapping Language Identifiers for Short Colloquial Postings

نویسندگان

چکیده

منابع مشابه

Bootstrapping Knowledge About Social Phenomena Using Simulation Models

A Study of Colloquial Language in Jalal Al-e-Ahmad’s Fictions

Improved Skips for Faster Postings List Intersection

Improved Skips for Faster Postings List Intersection

Bootstrapping Without the Boot

عنوان ژورنال:

اشتراک گذاری